Exploratory Data Analysis of Red Wine Quality by Xiaolan Yuan

Univariate Plots Section

The data set in this exploratory data analysis contains observations of 1599 different samples of red wine associated with the levels of the red wine quality and 12 original attributes. We added 1 more attribute ‘total.acidity’ to represent the sum of three types of acids in the data set.

The dimension of the data set is listed as below:

## [1] 1599   13

The name and type of each variable are shown as below:

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ total.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...

To get a better understanding of the data structure, it is necessary to have a look at the descriptive statistics of each variable.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##    sulphates         density             pH           alcohol     
##  Min.   :0.3300   Min.   :0.9901   Min.   :2.740   Min.   : 8.40  
##  1st Qu.:0.5500   1st Qu.:0.9956   1st Qu.:3.210   1st Qu.: 9.50  
##  Median :0.6200   Median :0.9968   Median :3.310   Median :10.20  
##  Mean   :0.6581   Mean   :0.9967   Mean   :3.311   Mean   :10.42  
##  3rd Qu.:0.7300   3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:11.10  
##  Max.   :2.0000   Max.   :1.0037   Max.   :4.010   Max.   :14.90  
##     quality      total.acidity  
##  Min.   :3.000   Min.   : 4.60  
##  1st Qu.:5.000   1st Qu.: 7.10  
##  Median :6.000   Median : 7.90  
##  Mean   :5.636   Mean   : 8.32  
##  3rd Qu.:6.000   3rd Qu.: 9.20  
##  Max.   :8.000   Max.   :15.90

The summary depicts a rough distribution of values of each variable. We noticed:

Next, we will use histograms to have better sense of the distributions of each attributes of red wine.

Brief introduction : The density of fixed acidity contains most acids involved in red wine, which do not evaporate readily. The majority of fixed acids is tartaric acid, which plays an important role in maintaining the chemical stability of the wine and its color and finally in influencing the taste of the finished wine.

Distribution: It has a slightly long-tail distribution in our data set. A popular range of density of fixed acidity is approximately from \(6.5 \sim 9.0\) \(g/dm^3\). There are a few outliers with fixed acidity density below \(5\) \(g/dm^3\) or above \(13\) \(g/dm^3\).

Brief introduction : From Wikipedia we know that acetic acid in wine, often referred to as volatile acidity (VA) or vinegar taint, can be contributed by many wine spoilage yeasts and bacteria. The volatile acidity represents the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

Distribution: The volatile acidity density distribution seems to be an unclear bimodal distribution. One mode is around \(0.4\) \(g/dm^3\) and another is around \(0.6\) \(g/dm^3\). We can also notice a high peak around \(0.5\) \(g/dm^3\). In the meanwhile, this distribution also has a few outliers above \(1.0\) \(g/dm^3\).

Further interests: It would be intereting to further explore the quality of red wine sample groups with the two popular volatile acidity, i.e. around \(0.4\) \(g/dm^3\) and around \(0.6\) \(g/dm^3\), along with the amount of antimicrobial agent (sulfur dioxide).

Brief introduction: Citric acid is an inexpensive supplements which can be used to boost the wine’s total acidity. It can add aggressive citric flavors to the wine. In the European Union, use of citric acid for acidification is prohibited.

Distribution: The distirbution of citric acid density looks like an exponential distribution. Most of red wine contains a samll amount of citric acid. However, there are four tall vertical bars when citric acid density reaches \(0\) \(g/dm^3\), \(0.02\) \(g/dm^3\), \(0.24\) \(g/dm^3\), and \(0.48\) \(g/dm^3\). The peak at \(0\) \(g/dm^3\) is consistent with European Union’s usage prohibition. The majority range of citirc acid density is from \(0\) \(g/dm^3\) to \(0.7\) \(g/dm^3\).

Further interests: The reasons for the other three high peaks are unknown yet. Since citric acid is added to boost wine’s total acitity, we may later explore the underlying reason by looking at the relationship between total acidity density and citric density.

The total density of acids is dominated by fixed acids, and its histogram looks almost the same as the histogram of fixed acids.

Brief introduction: The variable pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). Total acidity tells us the concentration of acids present in wine whereas the pH level tells us how intense those acids taste.

Distribution: The distribution of counts of pH is normal with a mean as \(3.311\). Most red wine samples in our data set have pH values from \(3.0\) to \(3.6\).

Further interests: pH is an index of acids taste, which can be decreased by sweetness by the website Understanding Acidity in Wine. We would like to further dig into to relationship between red wine quality and pH in groups of different sweetness level.

Brief introduction: ‘residual.sugar’ represents the amount of sugar remaining after fermentation stops. From How Basic Wine Characteristics Help You Find Favorites, we know that the level of sweetness is one important aspect of red wine quality.

Distribution: The original distribution in the left plot is highly left-skewed, so we used a log transformation on the density of residual sugar to get the right plot. On the right hand side, we find the log transformation of residual.sugar makes the distribution looks normal distributed. The center is around \(2.1\) \(g/dm^3\). The main range of residual sugar density is from \(1.2\) \(g/dm^3\) to \(3.0\) \(g/dm^3\).

Further interests: Since there are a few samples with extremely high density of residual sugar, we would like to divide the whole data set into 3 different groups with residual sugar level from low to high, and then explore the red wine quality.

Brief introduction: ‘chlorides’ represents the total amount of salt in the red wine. From Chloride concentration in red wines: influence of terroir and grape type, we know that wine contains salts of mineral acids, along with some organic acids, and they may have a key role on a potential salty taste of a wine, with chlorides being a major contributor to saltiness. Moderate to large concentrations of chlorides and sodium might give the wine a salty flavor which may turn way potential consumers.

Distribution: The counts distribution of chlorides density had also been transformed by taking logarithm because of the highly left-skewed property. The main range of the density is from \(0.05\) \(g/dm^3\) to \(0.12\) \(g/dm^3\) with a center around \(0.08\) \(g/dm^3\).

Further interests: Similar to the analysis of density of residual sugar, we would like to divide the whole data set into three different saltness level from low density to high density to further explore the relationship between red wine quality and chlorides density.

Brief introduction: The total sulfur dioxide is consist of amount of free and bound forms of sulfur dioxide. The free form of sulfur dioxide exists in equilibrium between molecular sulfur dioxide (as a dissolved gas) and bisulfite ion. Meanwhile, sulphates contribute to sulfur dioxide gas levels. Sulfur dioxide prevents microbial growth and the oxidation of wine.

Distribution: The log transformation of these three variable about sulfur dioxide have normal counts distribution. The free sulfur dioxide distribution has two peaks around \(6\) \(g/dm^3\) and around \(15\) \(g/dm^3\) with main range from \(3\) \(g/dm^3\) to \(50\) \(g/dm^3\). The total sulfur dioxide distribution is centered around \(45\) \(g/dm^3\) with main range from \(8\) \(g/dm^3\) to \(140\)\(g/dm^3\). The sulphate distribution is centerd around \(0.58\) \(g/dm^3\) withmain rangefrom \(0.4\) \(g/dm^3\) to \(1.2\) \(g/dm^3\).

Further interests: It would be interesting to think about how these three variables about sulfur dioxide works together to influence red wine quality.

Brief introduction: The alcohol density in a red wine the one of the most important property of its quality.

Distribution: The alcohol percentage has long-tail distribution mainly from \(9\%\) to \(14\%\). There is no need to take logarithm of alcohol percentage duo to its samll range.

Brief introduction: The quality level from poor to good is represented as integers from low to high.

Distribution: In the histogram, we observed that the majority of the red wine sample are in quality levels from 5 to 6, which are of the middle level, which is reasonable in red wine market.

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observation and 12 original variables. The variables can be divided into 4 part:

  • variables linked to acids: ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘pH’

  • varibales linked to other main features: ‘residual.sugar’, ‘alcohol’, ‘density’,‘chlorides’

  • variables linked to additives: ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’

  • main variables: ‘quality’

Summar of Univariate Analysis: By the preliminary analysis above, we observed that most of the red wine samples have alcohol percentage from \(9\%\) to \(14\%\), pH from \(3.0\) to \(3.6\), residual sugar density from \(1\) \(g/dm^3\) to \(3\) \(g/dm^3\) and total acids density from \(5\) \(g/dm^3\) to \(14\) \(g/dm^3\).

What is/are the main feature(s) of interest in your dataset?

In this analysis, the main interests would be ‘quality’.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Based on the red wine background knowledage we searched, we may intuitively speculate that the variables of main feature of the wine taste such as ‘alcohol’, ‘residual.sugar’, ‘chlorides’, ‘pH’, ‘fixed.acidity’ and ‘volatile.acidity’ will mainly contributed to the red wine quality. The additive attirbutes ‘citric.acid’ and three variables about sulfur dioxide may also influence the red wine quality in some degree.

Did you create any new variables from existing variables in the dataset?

Yes, the ‘total.acidity’ represents the sum of three differnt acids density in the red wine samples data set.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form

of the data? If so, why did you do this?

Many variables such as ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’ and ‘sulphates’ are long-tail distributions. We transfomed the x-axis scale to ‘log10’ scale to make the distribution more symmetric.

Bivariate Plots Section

Here we will investigate the relationship between pairs of any two variables by the correation matrix below.

From above, we can draw conclusions:

Now, we start to explore relationships between ‘quality’ and other variables.

Volatile acids and quality:

The correlation coefficient between ‘quality’ and ‘volatile.acidity’ is \(-0.35\). We can infer that the density if volatile acids decrease red wine quality level significantly. Let’s have a better look at distribution of volatile acid denstiy in different quality levels.

We can observe that the red wine samples with higher quality have samller range of volatile acids density. For red wine in low quality level (level 3 to 4), the mean and median of volatile acids are above 0.6. For medium level red wine (level 5 to 6), the median and mean of volatile acids are located in \((0.5,0.6)\). Meanwhile, the high level red wine (level 7 to 8) have median and mean around 0.5.

Recall that the distribution of ‘volatile.acidity’ is bimodal in previous analysis. This is because important variables to influence density of volatile acidity have not been discovered yet. Since sulphate has a relatively high correlation coefficients with volatile acids density. We would like to plot the distibution of volatile acids density in two different sulphates levels.

From above, we can see that the level of sulphates density plays a critical role in identifing different volatile acidity density poplulations.

Balance of sourness and sweetness:

Since sourness and sweetness are two important tastes in red wine quality, we want to explore the influence of these two tastes.

From these two boxplots, we noticed that the total acidity, representing sourness, and residual sugar, representing sweetness, have similar distibution patterns. From level 3 to level 7, we observed a increasing trend in sourness and sweetness. However, we also observed a drop in both sourness and sweetness from level 7 to level 8. Since sweetness can be masked by sourness in red wine, it is reasonable to calculate how many times total acidity is the density of residual sugar. The result is shown as follows.

This plot makes more sense thant the previous two. We observed that when red wine quality levels higher than 4 , the median and mean of the mutiples are almost the same around \(3.75\). This is futher explained in the right figure where you can see red wine with quality level higher than 4 (‘mid’ and ‘high’ groups) has linear regression line with similar slope. This migh be a reasonable ratio between sourness and sweetness.

Chlorides and quality:

As we have analyzed above, moderate to large concentrations of chlorides and sodium might give the wine a salty flavor which may turn way potential consumers. In addition, the correlation of ‘chlorides’ and ‘quality’ is \(-0.161\). Hence, we want to explore the relationship between chlorides and quality.

By checking the distribution of chlorides density in different quality level, we noticed that the in lowest quality level (level 3) red wine tends to have a lager range of chlorides density and the higest mean of chlorides densitt. In quality level from 4 to 7, the chlorides density has similar distribution considering the majority of the red wine samples. In the highest level (level 8), red wine has the samllest range of chlorides desity and the lowest mean of chlorides desity.

Sulfur dioxide:

Since the correlation coefficient between ‘free.sulfur.dioxide’ and ‘quality’ is only \(-0.0723\), we only plotted the distributions of ‘total.sulfur.dioxide’ and ‘sulphates’ in different quality levels. From the left figure, we noticed that the sulfur dioxide density increased from level 3 to level 6 and then decreased from level 6 to level 8. An intuitive speculation would be that red wine in median level used a larger amount of sulfur dioxide as antimicrobials, but red wine in high level might have used other better antimicrobial methods rather than adding sulfur dioxide. From the right figure, we see a monotonous increasing trend of sulphates density as red wine quality level goes up. This suggests that the sulphates contribute to have a good red wine quality.

Alcohol:

Alcohol density has the highest correlation coefficients with red wine quality as \(0.495\). Hence it is the relevant attribute of red wine quality.

As you can see, alcohol percentage does not significantly contribute to a good red wine quality when red wine has a low quality level (level 3 to 4). The reason might be that red wine in low quality level did not go through a natural fermentation. Instead, The winemaker may added artificial alcohol to the red wine. Consider the red wine quality from level 5 to level 8, you will find the red wine quality improves linearly as alcohol density goes up.

Density:

Density is another main attribute of redwine quality, althought it does not varies widely in different red wine. The variable ‘density’ still has a noticable correlation coefficients as \(-0.177\) which shows that red wine with high density may have a worse quality in some degree.

We noticed that red wine density decreases as quality increases in most cases, except for an increaseing trend from level 4 to level 5. This inconsistent trend divides whole red wine two groups as bad quality group (level 3 to 4) and good quality group (level 5 to 8). This division can also be suitable in analysis of ‘alcohol’, ‘total.sulfur.dioxide’ and ‘total.acidity’. With this division method, these attributes we analyzed would have monotonously increasing/decreasing trend as quality goes up in each quality group.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The main feature of interest in this analysis in ‘quality’. In this section, we found ‘alcohol’ is linearly highly correlated to ‘quality’ with coeffcient \(0.495\). However, there does exist a situation that when red wine has low quality level, it does not have a low alcohol percentage, which might be a result of adding artifical alcohol.

‘volatile.acidity’ and ‘chlorides’ creates unpleasent tastes to red wine, the less ‘volatile.acidity’ and ‘chorides’ in red wine, the better quality it would have. ‘density’ also influence red wine quality in a negative way. Red wine with lower density would have a more clear texture and a better quality.

We also found that it makes more sense to consider ratio between ‘total.acidity’ and ‘residual.sugar’ to analysis redwine quality rather than consider them seperately. Red wine with middle to high quality levels has similar ratio around \(3.75\).

For antimicrobials, ‘sulphates’ contributes to good red wine quality monotonously, while ‘total.sulfur.dioxide’ needs to find a balance value to avoide decrease red wine quality by adding to much sulfur dioxide.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We observed that ‘volatile.acidity’ has a biomodal distribution. After dividing the whole population into two subgroups according to the ‘sulphates’ density level, each subgroups has distribution with only one mode. This means ‘sulphates’ level influence volatile acids density significantly.

What was the strongest relationship you found?

The strongest relationship is between ‘quality’ and ‘alcohol’, which seems reasonable for red wine quality.

Multivariate Plots Section

In this section, we will futher explore the relationship between red wine quality and related factors. Here we would see the how the relationship between red wine quality and alcohol changes under influence of other factors.

Remark: for better intuitive understanding:

As you can see from above, when red wine has higher alcohol density, its quality is more likely to be better. For red wine with quality level less than 6, the large data trunk accumulated around alcohol percentage 10. As red wine quality goes better, the data trunck for each level moves more the right hand side, which represents higher alcohol density. We are interested in whether this patter would be kept in different red wine sample groups.

Three different red wine density groups

From the two plots above, we again confirmed our previous conclusion that red wine with lower density level has larger proprotion of high quality level than red wine with higher density levels. Further more, from the second plot we can see that in three differnet density level groups, the basic pattern of relationship between quality and alcohol remains.

Three different volatile acids density groups

From the two plots above, it is not hard to find that volatile acids do influence red wine quality in a negative way. In the second plot we see that the basic pattern remains in each group.

Three different sulphates density groups

These two plots give us similar conclusion above: the basic patter if relationship between quality and alcohol remains and adding chlorides gave red wine unpleasent smell and hencedecreased red wine quality.

More complex patterns

Instead of just analyzing the basic pattern of relationship between red wine quality and alcohol density, we would like to check more comlplex patterns combine one more related factor in different sample subgoups.

We can draw conclusion from the above plot:

  • In each subgroup, the density pattern does not change, which means the red wine with lower density level tends to have better quality and higher alcohol percentage.

  • Different cholorides density levels influence red wine density significantly while the different volatile acids density levels do not.

However, the volatile acides density does not hold same pattern in every subgroup. In the three subgroups in the last row, the distribution of data points seems to have a random trend rather than a trend that red wine with lower density volatile acids has better quality. One reasonable sepculation is that when red wine has high density, it is hard to tell the unpleasent taste created by volatile acids. It might be covered by other tastes.

In this plot, the pattern of chlorides density level is only significant in the first column. In mid and high red wine density group, red wine with low density level of chlorides does not show a high level of red wine quality. The reason might be the same as above, as red wine density increases, the unpleasent smell of chlorides might be covered by other smell or tastes.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section, we explored the relationship between ‘quality’ and ‘alcohol’ along with three related variables, ‘density’, ‘chlorides’ and
‘volatile.acidity’. First of all, we already know that ‘quality’ is positively correlated with ‘alcohol’ in the last setction. Then by the observation in this section, we noticed that in mid or high level of red wine density subgroup, the unpleasent smell or tastes created by chlorides and volatile acids might be covered by other tastes so that the patterns of chlorides and volatile acids density level changes in those subgroups.

Were there any interesting or surprising interactions between features?

Red wine density level has a stable pattern in different subgroups. Hence ‘density’ is a critical attributes to describe the red wine quality. However, the chlorides and volatile acids density levels do not hold a stable pattern in different subgoups. Their influence to red wine quality is more obvious in low density red wine subgroups.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot 1 explained the reason why volatile acidity distribution is bimodal. From this plot we observed that when we divided the red wine sample population into two subgroups via sulphates density level, in each subgroup, the volatile acids distribution has only one mode. Thus sulphates density influence volatile acidity in an obvious way.

Plot Two

Description Two

In the second plot, we explored the balance between sourness and sweetness in red wine. In the left figure, we use box plot to show the distribution of the ratio between total acids density and residual sugar in different quality level. In the right figure, we use scatter plot and linear regression of explore the relationship between total acids density and redsidual sugar density. In both left figure and right figure, we can draw a conclusion that red wine with quality level higher than 4 has a relatively stable ratio between total acids density and residual sugar density.

Plot Three

Description Three

The last plot shows that the red wine density influence red wine quality in a same way in different subgroups. From this phenomenon we can infer that the red wine density is a critial feature to judge red wine quality.


Reflection

This exploratory data analysis is about a dataset including information about different ingredients of red wine samples. This dataset has 1599 observations and 12 variables except for the index. In initial phase, I explored each variables by doing univariable analysis. In this preliminary exploration, I noticed that some of the variables have long-tail distribution. Hence I transformed the x-axis scale to ‘log10’ scale and varified the log distribution is bell-like and symmetric. I noticed that the quite a few red wine samples have ‘citric.acid’ value equal to 0. This made me start to wondering whether citric acid is a good additive in red wine, and how about other additives. By showing the correlation matirx, I decided to focus on exploring the variables of higher than 0.2 correlation coefficients with ‘quality’ in pairs. Next I compared those pairs along with other variables together.

The main finds can be summarized as follows. In bivariate analysis, I obtained two surprising observations. The first one is that for red wine with medium and high level quality, the ratio between sourness and sweetness is stable. Namely the total acids and residual sugar have a almost fixed density ratio in these red wine samples. The second one is that volatile acids would create unpleasent tastes to red wine, and it can be reduced by adding sulphates. Further more, I also found that the influence of different variables to red wine quality is different in low quality level red wine sample and higher quality level red wine samples and ‘alcohol’ has the strongest linear correlation to ‘quality’. Next, in multivariate analysis, I found noticed that ‘density’ is a critical feature of red wine quality, because the patterns of red wine density in different subgroups have same trend that red wine with lower density is more likely to be in a better quality level. Meanwhile, ‘volatile.acidity’ and ‘chlorides’ does not influence red wine quality in a significant way if red wine has medium to high density. The unpleasent smell and tastes created by these two ingredients can be covered by other tastes. In short, I found two important features to judge red wine quality: ‘alcohol’ and ‘density’. There are still other related features that influence red wine quality: ‘chlorides’, ‘sulphates’, ‘volatile.acidity’ and the ratio between ‘total.acidity’ and ‘residual.sugar’.

Further interests based on this analysis could be as follows: